sentence and score label. Read the specificiations of the dataset for details. helper functions in the folder of the first lab session (notice they may need modification) or create your own.You can submit your homework following these guidelines: Git Intro & How to hand your homework. Make sure to commit and save your changes to your repository BEFORE the deadline (Oct. 29th 11:59 pm, Tuesday).
### Begin Assignment Here
# necessary for when working with external scripts
%load_ext autoreload
%autoreload 2
We start by setting up our libraries, helpers and take home excercises data set prepartation.
Apparently github doesn't handle very well the new version of plotly library, please refer to Rendered Homework 1 if graphs are missing.
# import libraries
import pandas as pd
import numpy as np
import nltk
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import plotly as py
import math
%matplotlib inline
from sklearn import preprocessing, metrics, decomposition, pipeline, dummy
# data visualization libraries
import matplotlib.pyplot as plt
from plotly import tools
import seaborn as sns
from mpl_toolkits import mplot3d
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats.stats import pearsonr
from sklearn.naive_bayes import MultinomialNB
##---Take home excercises setup---#
# prepare dataset
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
# my functions
import helpers.data_mining_helpers as dmh
# construct dataframe from a list
X = pd.DataFrame.from_records(dmh.format_rows(twenty_train), columns= ['text'])
# add category to the dataframe
X['category'] = twenty_train.target
# add category label also
X['category_name'] = X.category.apply(lambda t: dmh.format_labels(t, twenty_train))
Experiment with other querying techniques using pandas dataframes.
#query the first 15 records where categry is either 1, 2 or 3
X.query('(category == [1, 2, 3])')[0:15]
Please check the data and the process below, describe what you observe and why it happened. $Hint$ : why .isnull() didn't work?
NA_dict = [{ 'id': 'A', 'missing_example': np.nan },
{ 'id': 'B' },
{ 'id': 'C', 'missing_example': 'NaN' },
{ 'id': 'D', 'missing_example': 'None' },
{ 'id': 'E', 'missing_example': None },
{ 'id': 'F', 'missing_example': '' }]
NA_df = pd.DataFrame(NA_dict, columns = ['id','missing_example'])
NA_df
NA_df['missing_example'].isnull()
# Answer here
'''
DataFrame.isnull(self) method will only return True if a value is missing,
NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values.
In this case the explicity declared strings 'NaN', 'None' and empty string '' are evaluated as False because the method
has no way of knowing what is the value inside the string.
'''
X_sample = X.sample(n=1000) #random state
len(X_sample)
X_sample[0:4]
Notice any changes to the X dataframe? What are they? Report every change you noticed as compared to the previous state of X. Feel free to query and look more closely at the dataframe for these changes.
# Answer here
'''
Our original data set X is not affected when using the DataFrame.sample() method.
The method only creates a copy of ramdomly selected items which then is assigned to the X_sample.
Since the samples are taken randomly, the method doesn't ensure sorting. If we want to sort in
ascending order we can do X_sample.sort_index().
'''
We can also do a side-by-side comparison of the distribution between the two datasets, but maybe you can try that as an excerise.
# Answer here
sample_counts = X_sample.category_name.value_counts()
actual_counts = X.category_name.value_counts()
combined_data_frame = pd.DataFrame({'dataset': actual_counts,
'sample': sample_counts}, index = categories)
print(combined_data_frame.plot.bar(title = 'Category Distribution', rot = 0, fontsize = 12, figsize = (8,4)))
We said that the 1 at the beginning of the fifth record represents the 00 term. Notice that there is another 1 in the same record. Can you provide code that can verify what word this 1 represents from the vocabulary. Try to do this as efficient as possible.
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(X.text)
# Answer here
array = X_counts[4:5, 0:100].toarray() #obtain the fifth record
'''
We can print all the words contained in the sentence from the first 100 words,
the second word printed correspond to the second 1 in the array
'''
for word in count_vect.inverse_transform(array)[0]:
print('word: %s' % word)
# Answer here
'''
We can use a sample of the whole documents to create a smaller term-document matrix which we can plot to observe some terms
that are more repeated than others.
We can also remove the vmax to be able to show different colors for different values in the term-document matrix and remove the
number labels inside the heatmap to make it less cluttered
'''
n = 150
sample_X = X.sample(n=n, random_state = 26)
sample_count_vect = CountVectorizer()
sample_counts = sample_count_vect.fit_transform(sample_X.text)
plot_x = ["term_"+str(i) for i in sample_count_vect.get_feature_names()[0:n]]
# obtain document index
plot_y = ["doc_"+ str(i) for i in list(sample_X.index)[:n]]
plot_z = sample_counts[0:n, 0:n].toarray()
df_todraw = pd.DataFrame(plot_z, columns = plot_x, index = plot_y)
plt.subplots(figsize=(18, 14))
ax = sns.heatmap(df_todraw,
cmap="PuRd",
vmin=0, annot=False)
Please try to reduce the dimension to 3, and plot the result use 3-D plot. Use at least 3 different angle (camera position) to check your result and describe what you found.
$Hint$: you can refer to Axes3D in the documentation.
# Answer here
X_reduced3 = PCA(n_components = 3).fit_transform(X_counts.toarray())
print('Dimension:')
print(X_reduced3.shape)
col = ['coral', 'blue', 'black', 'm']
fig = plt.figure(figsize = (25,10))
ax1 = fig.add_subplot(2,2,1, projection='3d')
ax2 = fig.add_subplot(2,2,2, projection='3d')
ax3 = fig.add_subplot(2,2,3, projection='3d')
ax4 = fig.add_subplot(2,2,4, projection='3d')
for c, category in zip(col, categories):
xs = X_reduced3[X['category_name'] == category].T[0]
ys = X_reduced3[X['category_name'] == category].T[1]
zs = X_reduced3[X['category_name'] == category].T[2]
ax1.scatter3D(xs, ys, zs, c= c, marker = 'o')
ax2.scatter3D(xs, ys, zs, c= c, marker = 'o')
ax3.scatter3D(xs, ys, zs, c= c, marker = 'o')
ax4.scatter3D(xs, ys, zs, c= c, marker = 'o')
ax1.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax1.set_xlabel('\nX Label')
ax1.set_ylabel('\nY Label')
ax1.set_zlabel('\nZ Label')
ax1.view_init(0, 0)
ax2.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax2.set_xlabel('\nX Label')
ax2.set_ylabel('\nY Label')
ax2.set_zlabel('\nZ Label')
ax2.view_init(90, 0)
ax3.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax3.set_xlabel('\nX Label')
ax3.set_ylabel('\nY Label')
ax3.set_zlabel('\nZ Label')
ax3.view_init(0, 90)
ax4.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax4.set_xlabel('\nX Label')
ax4.set_ylabel('\nY Label')
ax4.set_zlabel('\nZ Label')
ax4.view_init(30, 45)
plt.show()
Observations: Data appears to be more sparse along the z and y axes and appears to be more compact along the z and x axes
Interactive visualization of term frequencies
term_frequencies = np.asarray(X_counts.sum(axis=0))[0]
# Answer here
data = go.Bar(x = ["term_"+str(i) for i in count_vect.get_feature_names()[0:300]],
y=term_frequencies[:300])
fig = go.Figure(data)
fig.update_layout(
title=go.layout.Title(
text="Term Frequencies",
xref="paper",
x=0
)
)
fig.show()
Visualization of reduced number of terms
# Answer here
term_frequencies_df = pd.DataFrame({'terms': count_vect.get_feature_names(),
'counts': term_frequencies})
sample_term_frequencies_df = term_frequencies_df.sample(n=100, random_state=26)
sample_data = go.Bar(x = ["term_"+str(i) for i in sample_term_frequencies_df['terms']],
y=sample_term_frequencies_df['counts'])
fig = go.Figure(sample_data)
fig.update_layout(
title=go.layout.Title(
text="Sample Data Terms",
xref="paper",
x=0
)
)
fig.show()
Sort the terms on the x-axis by frequency instead of in alphabetical order.
# Answer here
#for efficiency we will use a samnple of the dataset
#order the terms
ordered_term_frequencies_df = sample_term_frequencies_df.sort_values(by = 'counts', ascending = False)
#generate graph
ordered_data = go.Bar(x=["term_"+str(i) for i in ordered_term_frequencies_df['terms']],
y=ordered_term_frequencies_df['counts'])
fig = go.Figure(ordered_data)
fig.update_layout(
title=go.layout.Title(
text="Long-tailed distribution in Sample",
xref="paper",
x=0
)
)
fig.show()
Try to generate the binarization using the category_name column instead. Does it work?
# Answer here
mlb = preprocessing.LabelBinarizer()
mlb.fit(X.category)
X['bin_category'] = mlb.transform(X['category']).tolist()
X[:10]
mlb.fit(X.category_name)
X['bin_category'] = mlb.transform(X['category_name']).tolist()
X[:10]
Observation: Binarization with category_name column returns exactly the same result as binarization with category
In this section we perform different operations with the new data and also present data visualization from task 3.
#load data into python array
sentiment_data_array = []
with open("sentiment_labelled_sentences/amazon_cells_labelled.txt","r") as amazon_data:
sentiment_data_array += [string + '\tamazon' for string in amazon_data.read().split('\n')]
with open("sentiment_labelled_sentences/imdb_labelled.txt","r") as imdb_data:
sentiment_data_array += [string + '\timdb' for string in imdb_data.read().split('\n')]
with open("sentiment_labelled_sentences/yelp_labelled.txt","r") as yelp_data:
sentiment_data_array += [string + '\tyelp' for string in yelp_data.read().split('\n')]
#create dictionary with the array
sentiment_data = dmh.sentiment_data_dictionary(sentiment_data_array)
# construct dataframe from the created dictionary
sentiment_data_df = pd.DataFrame.from_records(data = {"sentence":sentiment_data['sentences'], "score":sentiment_data['scores'], "source":sentiment_data['sources']})
Print the first 10 records from the dataframe
# first 10 records from the dataframe
sentiment_data_df[:10]
Print the last 10 sentences
# last 10 records keeping only sentence and source column
sentiment_data_df[-10:][["sentence", "source"]]
Query every 10th record, first 10 records are printed
# using loc (by position)
# query every 10th record in our dataframe, the query also containig the first 10 records
sentiment_data_df.iloc[::10, 0:2][0:10]
Check for missing values
sentiment_data_df.isnull().apply(lambda x: dmh.check_missing_values(x))
There are no missing values in the dataset. To make sure that there will not be missing values the sentiment_data_dictionary method in the helpers file will ignore any rows that contain either a sentence none value or a score none value.
It this particular case it is better to ignore the rows that contain a none value because there is no way to generate a sentence if it is missing and if we try to estimate a 0 or 1 value for the score it will just contaminate the data.
Check for duplicated rows
"""
check if there are duplicated rows and remove them
"""
duplicates = sum(sentiment_data_df.duplicated('sentence'))
print('Number of rows before cleaning: %d' % len(sentiment_data_df))
print('Duplicated rows: %d' % duplicates)
if duplicates > 0:
sentiment_data_df.drop_duplicates(keep=False, inplace=True)
sentiment_data_df.reset_index(drop=True, inplace=True) # this is necessary because when droping the duplicates the index
#keeps the old original values and when trying to access them with [] the wrong values were being returned.
print('Rows after dropping duplicates: %d' % len(sentiment_data_df))
Sampling
#sampling
records = 700
sentiment_data_sample = sentiment_data_df.sample(n=records, random_state=26)
#show the sampling and actual data counts in a bar graph
sample_counts = sentiment_data_sample.score.value_counts()
actual_counts = sentiment_data_df.score.value_counts()
combined_data_frame = pd.DataFrame({'data': actual_counts,
'sample': sample_counts})
print(combined_data_frame.plot.bar(title = 'Sentiment Distribution', rot = 0, fontsize = 12, figsize = (8,4), tick_label = ['negative', 'positive']))
Pie chart Vizualization
labels = sentiment_data_df.source.value_counts().index
values = sentiment_data_df.source.value_counts()
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(
title=go.layout.Title(
text="Sentences Source",
xref="paper",
x=0
)
)
fig.show()
Scatter plot Visualization
#scatter plot visualization
#show the relation between the word count in each sentence and what sentiment it is attached to
sentiment_data_sample_2 = sentiment_data_df.sample(n = 100)
x_axis_array = ["sentence_" + str(index) for index in sentiment_data_sample_2.index]
fig = go.Figure()
# Add traces
fig.add_trace(go.Scatter(x=x_axis_array, y=[len(sentence.split(' ')) for sentence in sentiment_data_sample_2.sentence],
mode='lines+markers',
name='Sentence Word Count'))
fig.add_trace(go.Scatter(x=x_axis_array, y=[score for score in sentiment_data_sample_2.score],
mode='lines+markers',
name='Sentence Score'))
fig.update_layout(
title=go.layout.Title(
text="Scatter Plot",
xref="paper",
x=0
)
)
fig.show()
positive_sentences_array = [len(row.sentence.split(' ')) for index, row in sentiment_data_sample_2.iterrows() if row.score == '1']
print('Average word count in each positive sentence: ', sum(positive_sentences_array)/len(positive_sentences_array))
negative_sentences_array = [len(row.sentence.split(' ')) for index, row in sentiment_data_sample_2.iterrows() if row.score == '0']
print('Average word count in each negative sentence: ', sum(negative_sentences_array)/len(negative_sentences_array))
Feature Creation
Data frame with unigrams:
sentiment_data_df['unigrams'] = sentiment_data_df['sentence'].apply(lambda x: dmh.tokenize_text(x))
Feature Subset Selection
Document-Term matrix:
count_vect = CountVectorizer()
frequency_counts = count_vect.fit_transform(sentiment_data_df.sentence)
print("Document Term Matrix Size:", frequency_counts.shape)
Heatmap visualization
#heaptmap visualization
sample_count_vect = CountVectorizer()
sample_counts = sample_count_vect.fit_transform(sentiment_data_sample.sentence)
plot_x = ["term_"+str(i) for i in sample_count_vect.get_feature_names()[0:records]]
plot_y = ["doc_"+ str(i) for i in list(sentiment_data_sample.index)[:records]]
plot_z = sample_counts[0:records, 0:records].toarray()
df_todraw = pd.DataFrame(plot_z, columns = plot_x, index = plot_y)
plt.subplots(figsize=(18, 14))
ax = sns.heatmap(df_todraw,
cmap="PuRd",
vmin=0, annot=False)
Dimensionality Reduction
PCA with 2 components along with its scatter plot:
#2 dimension PCA
colors = ['coral', 'blue']
scores = ['0','1']
sentiment_data_reduced2 = PCA(n_components = 2).fit_transform(frequency_counts.toarray())
fig = plt.figure(figsize = (25,10))
ax = fig.subplots()
for c, score in zip(colors, scores):
xs = sentiment_data_reduced2[sentiment_data_df['score'] == score].T[0]
ys = sentiment_data_reduced2[sentiment_data_df['score'] == score].T[1]
ax.scatter(xs, ys, c = c, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
plt.show()
PCA with 3 components along with its scatter plot:
#3 dimension PCA
sentiment_data_reduced3 = PCA(n_components = 3).fit_transform(frequency_counts.toarray())
fig = plt.figure(figsize = (25,10))
ax1 = fig.add_subplot(1,1,1, projection='3d')
for c, score in zip(colors, scores):
xs = sentiment_data_reduced3[sentiment_data_df['score'] == score].T[0]
ys = sentiment_data_reduced3[sentiment_data_df['score'] == score].T[1]
zs = sentiment_data_reduced3[sentiment_data_df['score'] == score].T[2]
ax1.scatter3D(xs, ys, zs, c= c, marker = 'o')
ax1.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax1.set_xlabel('\nX Label')
ax1.set_ylabel('\nY Label')
ax1.set_zlabel('\nZ Label')
ax1.view_init(25, 45)
plt.show()
Attribute Transformation/Aggregation
Creating an Array that contains the sum of the term frequencies for each word. Bar graph visualization (zoom for details).
#create a term frequencies array
term_frequencies = np.asarray(frequency_counts.sum(axis=0))[0]
data = go.Bar(x=["term_"+str(i) for i in count_vect.get_feature_names()],
y=term_frequencies)
fig = go.Figure(data)
fig.update_layout(
title=go.layout.Title(
text="Term Frequencies",
xref="paper",
x=0
)
)
fig.show()
Long-tailed distribution of term frequencies. Bar graph visualization, term frequencies in descending order (zoom for details).
#ordered term frequencies visualization
term_frequencies_df = pd.DataFrame({'terms': count_vect.get_feature_names(),
'counts': term_frequencies})
ordered_term_frequencies_df = term_frequencies_df.sort_values(by = 'counts', ascending = False)
ordered_data = go.Bar(x=["term_"+str(i) for i in ordered_term_frequencies_df['terms']],
y=ordered_term_frequencies_df['counts'])
fig = go.Figure(ordered_data)
fig.update_layout(
title=go.layout.Title(
text="Ordered Term Frequencies",
xref="paper",
x=0
)
)
fig.show()
Binarization
score column has only 2 possible values, so it makes more sence to apply binarization to the source parameter.
mlb = preprocessing.LabelBinarizer()
mlb.fit(sentiment_data_df.source)
sentiment_data_df['bin_source'] = mlb.transform(sentiment_data_df['source']).tolist()
# print first 10 rows
sentiment_data_df[0:9]
Print the last 10 rows
sentiment_data_df[-10:]
#TF-IDF
tf_idf_vect = TfidfVectorizer()
tf_idf_counts = tf_idf_vect.fit_transform(sentiment_data_df.sentence)
print('First 10 Feature Names:', tf_idf_vect.get_feature_names()[0:10])
print('TF-IDF Matrix Size:', tf_idf_counts.shape)
TF IDF vs Term Frequency comparison visualization
#visualize the 25 terms with the highest values in the TF-IDF Matrix and compare it with the highest values in the
#Count Frequency Matrix
n = 25
term_tf_idf = np.asarray(tf_idf_counts.sum(axis=0))[0]
term_tf_idf_df = pd.DataFrame({'terms': tf_idf_vect.get_feature_names(),
'counts': term_tf_idf})
ordered_term_tf_idf_df = term_tf_idf_df.sort_values(by = 'counts', ascending = False)
fig = make_subplots(rows=1, cols=2)
fig.add_trace(
go.Bar(
x=["term_"+str(i) for i in ordered_term_tf_idf_df['terms']][:n],
y=ordered_term_tf_idf_df['counts'][:n],
name = "TF-IDF"),
row=1, col=1
)
fig.add_trace(
go.Bar(
x=["term_"+str(i) for i in ordered_term_frequencies_df['terms'][:n]],
y=ordered_term_frequencies_df['counts'][:n],
name = "Word Counts"),
row=1, col=2
)
fig.update_layout(height=700, width=900, title_text="TF-IDF Ordered Data Terms vs Counts Ordered Data Terms")
fig.show()
#Naive Bayes classifier
#term frequency
mnb_term_frequency = MultinomialNB()
mnb_term_frequency.fit(frequency_counts, sentiment_data_df['score'].values)
#TF-IDF
mnb_tf_idf = MultinomialNB()
mnb_tf_idf.fit(tf_idf_counts, sentiment_data_df['score'].values)
#testing the accuracy of the classifier
#test with a single random sentence from the data set
r_sentence = sentiment_data_df.sample(n = 1)
print('Sentence:',r_sentence.iloc[0].sentence,'\nReal Score:', r_sentence.iloc[0].score)
r_sentence_index = r_sentence.index[0]
print('Term Frequency MNB Prediction: ', mnb_term_frequency.predict(frequency_counts[r_sentence_index:r_sentence_index+1])[0])
print('TF-IDF MNB Prediction: ', mnb_tf_idf.predict(tf_idf_counts[r_sentence_index:r_sentence_index+1])[0])
#obtain the accuracy of both models comparing the predicted values with the actual scores of all the sentences in the data set
total_sentences = len(sentiment_data_df)
tf_correct_prediction = 0
tf_idf_correct_prediction = 0
for index, row in sentiment_data_df.iterrows():
tf_prediction = mnb_term_frequency.predict(frequency_counts[index:index+1])[0]
if row.score == tf_prediction:
tf_correct_prediction+=1
tf_idf_prediction = mnb_tf_idf.predict(tf_idf_counts[index:index+1])[0]
if row.score == tf_idf_prediction:
tf_idf_correct_prediction+=1
print('Term Frequency MNB Prediction Accuracy: %.2f%%' % ((tf_correct_prediction/total_sentences)*100))
print('TF-IDF MNB Prediction Accuracy: %.2f%%' % ((tf_idf_correct_prediction/total_sentences)*100))
#testing with a sentences that are not in the dataset
negative_sentence = "I was very disappointed with the service"
n_word_freq = count_vect.transform([negative_sentence]).toarray()
n_word_tf_idf = tf_idf_vect.transform([negative_sentence]).toarray()
print('Term Frequency MNB Prediction for negative sentence: ', mnb_term_frequency.predict(n_word_freq[0:1])[0])
print('TF-IDF MNB Prediction for negative sentence: ', mnb_tf_idf.predict(n_word_tf_idf[0:1])[0])
positive_sentence = "Highly recommended, the service is very good"
p_word_freq = count_vect.transform([positive_sentence]).toarray()
p_word_tf_idf = tf_idf_vect.transform([positive_sentence]).toarray()
print('Term Frequency MNB Prediction for positive sentence: ', mnb_term_frequency.predict(p_word_freq[0:1])[0])
print('TF-IDF MNB Prediction for positive sentence: ', mnb_tf_idf.predict(p_word_tf_idf[0:1])[0])
# the following is not necessary
## we can achieve the same result with just print(twenty_train.data[0])
print("\n".join(twenty_train.data[0].split("\n")))
# using a simple print()
print(twenty_train.data[0])
# in this piece of code the same thing is assigned two times
for j in range(0,X_counts.shape[1]):
term_frequencies.append(sum(X_counts[:,j].toarray()))
term_frequencies = np.asarray(X_counts.sum(axis=0))[0]
For further exploration we also worked with similarities between sentiments. 5 random sentences with the same sentiment are compared, using sentences with the same sentiment increase the possibilities of having common words.
#distance similarity
for i in range(0,5):
#show cosine similarity and pearson's correlation coefficient of 2 random documents with negative sentiment (score == 0)
r_negative_sentences = sentiment_data_df[sentiment_data_df['score'] == '0'].sample(n = 2)
print('Sentences:')
for sentence in r_negative_sentences.sentence:
print('-%s' % sentence)
index1 = r_negative_sentences.index[0]
index2 = r_negative_sentences.index[1]
#obtain the rows in the count frequency matrix corresponding to those indexes
count_row1 = frequency_counts[index1:index1+1]
count_row2 = frequency_counts[index2:index2+1]
#obtain the rows in the TF-IDF matrix corresponding to those indexes
tf_idf_row1 = tf_idf_counts[index1:index1+1]
tf_idf_row2 = tf_idf_counts[index2:index2+1]
#cosine similarity
print("Cosine Similarity using term count:",cosine_similarity(count_row1, count_row2)[0][0])
print("Cosine Similarity using TF-IDF:",cosine_similarity(tf_idf_row1, tf_idf_row2)[0][0])
#Pearson's correlation coefficient
print("Pearson's correlation coefficient using term count:",pearsonr(count_row1.toarray().ravel(),count_row2.toarray().ravel())[0])
print("Pearson's correlation coefficient using TF-IDF:",pearsonr(tf_idf_row1.toarray().ravel(),tf_idf_row2.toarray().ravel())[0])
#Extended Jaccard coefficient
print("Extended Jaccard Coefficient using term count:", dmh.extended_jaccard_coefficient(count_row1.toarray().ravel(), count_row2.toarray().ravel()))
print("Extended Jaccard Coefficient using TF-IDF:", dmh.extended_jaccard_coefficient(tf_idf_row1.toarray().ravel(), tf_idf_row2.toarray().ravel()))
print("\n")
Most similarities will be zero(or close to zero) because there are very few words in common between the sentences, even in the sentences that have the same sentiment.
If we calculate the different similarity coefficients of 2 sentences we know have at least one word in common we can observe a non zero value for both the term count vector and the Tf-IDF vector.
#known negative sentence indexes with common words
index1 = 1455
index2 = 1178
print('Sentences:')
print('-%s' % sentiment_data_df.iloc[index1].sentence)
print('-%s' % sentiment_data_df.iloc[index2].sentence)
#obtain the rows in the count frequency matrix corresponding to those indexes
count_row1 = frequency_counts[index1:index1+1]
count_row2 = frequency_counts[index2:index2+1]
#obtain the rows in the TF-IDF matrix corresponding to those indexes
tf_idf_row1 = tf_idf_counts[index1:index1+1]
tf_idf_row2 = tf_idf_counts[index2:index2+1]
#similarities
print("Cosine Similarity of term count:",cosine_similarity(count_row1, count_row2)[0][0])
print("Cosine Similarity of TF-IDF:",cosine_similarity(tf_idf_row1, tf_idf_row2)[0][0])
print("Pearson's correlation coefficient using term count:",pearsonr(count_row1.toarray().ravel(),count_row2.toarray().ravel())[0])
print("Pearson's correlation coefficient using TD-IDF:",pearsonr(tf_idf_row1.toarray().ravel(),tf_idf_row2.toarray().ravel())[0])
print("Extended Jaccard Coefficient using term count:", dmh.extended_jaccard_coefficient(count_row1.toarray().ravel(), count_row2.toarray().ravel()))
print("Extended Jaccard Coefficient using TF-IDF:", dmh.extended_jaccard_coefficient(tf_idf_row1.toarray().ravel(), tf_idf_row2.toarray().ravel()))
The similarity of the vectors using the TF-IDF counts is always lower to the similarity of the vectors using the actual word counts.